Exploring mutexes, the Oracle®RDBMS retrial spinlocks 



(MEDIAS2012) 1 



Exploring mutexes, the Oracle(R)RDBMS retrial spinlocks 

Nikolaev A. S. 

Andrey . NikolaevOrdtex . ru , http : //andreynikolaev . wordpress . com 

RDTEX LTD, Protvino, Russia 



(N 

O 

(N 

o 

Q 

(N 



q 

O 



> 

o 

(N 



X 



Proceedings of International Conference on Informatics MEDIAS2012. 
Cyprus, Limassol, May 7-14, 2012. ISBN 978-5-88835-023-2 

Spinlocks are widely used in database engines for processes synchronization. KGX mutexes is new retrial spin- 
locks appeared in contemporary Oracle® versions for submicrosecond synchronization. The mutex contention is 
frequently observed in highly concurrent OLTP environments. 

This work explores how Oracle mutexes operate, spin, and sleep. It develops predictive mathematical model 
and discusses parameters and statistics related to mutex performance tuning, as well as results of contention 
experiments. 

l/ICC/ieAOBaHI/ie MbHDTeKCOB, Cni/1H6/10KI/ipOBOK C nOBTOpHblMI/l 

Bbi30BaMi/i B CVBfl Oracle® 

HuKOJiaee A. C. 

npoTBHHo, 3AO p;];tex 

CnunGjiOKHpoBKH mnpoKO HCnojiBSyroTCa b cuCTCMax ynpaBjienna 6a3aMH flauHBix fljia CHHxpoHH3au,HH npo- 
Li,eccoB. B coBpcMeHHBix Bepcnax CYB^ Oracle noaBHjincB KGX MbKiTCKCBi - hobbih thu cnHH5jiOKHpoBOK c 
noBTopubiMH Bbi30BaMH opHeHTHpoBaHHbiH Ha CHHxpoHH3aLi,Hio B cy6MHKpoceKyHflHBix MacmTa6ax. KoHKypeu- 
Li,Ha sa MBiOTeKCbi HacTO Ha6jiioflaeTca b BbicoKonarpyjKeHHbix OLTP cpe^ax. 

B pa6oTe HCCjreflyroTca oco6eHHOCTH Li;HKjiHpoBaHHH h ojKHflaHnii MbiOTeKCOB. Ha ochobb nocTpoeHHoii MaTe- 
MaTHMecKoii MOflejiu o6cy>KflaiOTca napaMBTpbi h CTaTHCTHKH CBasannbie c nacTpoHKOH nponsBOflHTejibnocTH 
MbroxeKCOB, a xaKJKe pesyjibTaxbi SKcnepHMeuTOB. 



I. Introduction 

According to Oracle® documentation T mutex is: 
"A mutual exclusion object . . . that prevents an object 
in memory from aging out or from being corrupted 

Huge Oracle RDBMS instance contains thousands 
processes accessing the shared memory. This shared 
memory named "System Global Area" (SGA) consists 
of millions cache, metadata and results structures. Si- 
multaneous access to these structures is synchronized 
by Locks, Latches and KGX Mutexes: 
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Latches and mutexes arc the Oracle realizations 
of spinlock concept. My previous article [30] explored 
latches, the traditional Oracle spinlocks known since 
1980th. Process requesting the latch spins 20000 cy- 



cles polling its location. If unsuccessful, process joins 
a queue and uses wait-posting to be awaken on latch 
release. 

The goal of this work is to explore the newest Or- 
acle spinlocks — mutexes. The mutexes were intro- 
duced in 2006 for synchronization inside Oracle Li- 
brary Cache. Table [T] compares Oracle internal syn- 
chronization mechanisms. 

Wikipedia defines the spinlock as "... a lock 
where the thread simply waits in a loop ("spins") re- 
peatedly checking until the lock becomes available. As 
the thread remains active but isn't performing a useful 
task, the use of such a lock is a kind of busy waiting". 

Use of spinlocks for multiprocessor synchronization 
was first introduced by Edsger Dijkstra in [2]. Since 
that time, the mutual exclusion algorithms were signif- 
icantly advanced. Various sophisticated spinlock real- 
izations (TS, TTS, Delay, MCS, Anderson, etc.) were 
proposed and evaluated. The contemporary review of 
these algorithms may be found in [3] 

Two general spinlock types exist: 

System spinlocks that protect critical OS structures. 
The kernel thread cannot wait or yield the CPU. It 
must loop until success. Most mathematical mod- 
els explore this spinlock type. Major metrics to 
optimize system spinlocks are frequency of atomic 
operations (or Remote Memory References) and 
shared bus utilization. 
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User application spinlocks like Oracle latches and 
mutexcs that protect user level structures. It is 
more efhcient to poll the mutcx for several mi- 
croseconds rather than pre-empt the thread doing 
1 millisecond context switch. The metrics to op- 
timize are spinlock acquisition CPU and elapsed 
times. 

In Oracle versions 10.2 to 11.2.0.1 (or, equivalently, 
from 2006 to 2010) the mutex spun using atomic "test- 
and-set" operations. According to Anderson classifi- 
cation [4] it was a TS spinlock. Such spinlocks may 
induce the Shared Bus saturation and affect perfor- 
mance of other memory operations. 

Since version 11.2.0.2 processes poll the mutex lo- 
cation nonatomically and only use atomic instructions 
to finally acquire it. The contemporary mutex became 
a TTS {" test- and-test- and- set") spinlock. 

System spinlocks frequently use more complex 
structures than TTS. Such algorithms, like famous 
MCS spinlocks [5] were optimized for 100% utiliza- 
tion. For the current state of system spinlock theory 
see [5]. 

If user spinlocks are holding for a long time, for ex- 
ample due to OS preemption, pure spinning becomes 
ineffective and wastes CPU. To overcome this, after 
255 spin cycles the mutex sleeps yielding the processor 
to other workloads and then retries. The sleep time- 
outs are determined by mutex wait scheme. 

Such spin- sleeping was first introduced in [7] to 
achieve balance between CPU time lost by spinning 
and context switch overhead. 

From the queuing theory point of view such sys- 
tems with repeated attempts are retrial queues. More 
precisely, in the retrial system the request that finds 
the server busy upon arrival leaves the service area 
and joins a retrial group (orbit). After some time this 
request will have a chance to try its luck again. There 
exists an extensive literature on the retrial queues. See 
[H [9] and references therein. 

The mutex retrial spin-sleeping algorithm signif- 
icantly differs from the FIFO spin-blocking used by 
Oracle latches [3D] . The spin-blocking was explored in 
[lOl [TTl [12] . Its robustness in contemporary environ- 
ments was recently investigated in |13| . 

Historically the mutex contention issues were hard 
to diagnose and resolve [55]. The mutexes are much 
less documented then needed and evolve rapidly. Sup- 
port engineers definitely need more mainstream sci- 
ence support to predict the results of changing mutex 
parameters. This paper summarizes author's work on 
the subject. Additional details may be found in my 

blog [ig. 

II. Oracle® RDBMS Performance Tun- 
ing overview 

Before discussing the mutexes, we need some in- 
troduction. During the last 33 years, Oracle evolved 
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Table 1. Serialization mechanisms in Oracle 



from the first onc-uscr SQL database to the most ad- 
vanced contemporary RDBMS engine. Each version 
introduced performance and concurrency advances: 

v. 2 (1979): the first commercial SQL RDBMS. 

v. 3 (1983): the first database to support SMP. 

v. 4 (1984): read-consistency. Database Buffer Cache. 

v. 5 (1986): Client-Server, Clustering, Distributed Database, 
SGA. 

v. 6 (1988): procedural language (PL/SQL), undo/redo, 
latches. 

v. 7 (1992): Library Cache, Shared SQL, Stored proce- 
dures, 64bit. 

v. 8/8i (1999): Object types, Java, XML. 

v. 9i (2000): Dynamic SGA, Real Application Clusters. 

v. lOg (2003): Enterprise Grid Computing, Self- Tuning, 
mutexes. 

v. llg (2008): Results Cache, SQL Plan Management, 
Exadata. 

v. llgR2 (2011): ...Mutex wait schemes. Hot ob- 
ject copies. 

v. 12c (2012): Cloud. . . 

As of now, Oracle is the most widely used SQL 
RDBMS. In majority of workloads it works perfectly. 
However, quick search finds more then 100 books 
devoted to Oracle performance tuning on Amazon 
[TOl fTTl [T5] . Dozens conferences covered this topic ev- 
ery year. Why Oracle needs such a tuning? 

The main reason is complex and variable work- 
loads. Oracle is working in very different environments 
ranging from huge OLTPs, petabyte OLAPs to hun- 
dreds multi-tenant databases running on one server. 
Every high-end database is unique. 

For the ability to work in such diverse conditions 
Oracle RDBMS has complex internals. To get the 
most out of hardware we need precise tuning. Working 
at Support, I cannot underestimate the importance of 
developers and database administrators education in 
this field. 

In order to diagnose performance problems Oracle 
instrumented its software. Every Oracle session keeps 
many statistics counters describing "what was done". 
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Fig. 2. Oracle mutex workflow. 

Oracle Wait Interface (OWI) [TH] events describe "why 
the session waits" and complement the statistics. 

Statistics, OWI and data from internal {"fixed^) 
X$ tables arc used by Oracle diagnostics and visual- 
ization tools. 

This is the traditional framework of Oracle perfor- 
mance tuning. However, it was not effective enough in 
spinlocks troubleshooting. 

The DTrace. 

To observe the mutexes work and short duration 
events, we need something like stroboscope in physics. 
Likely, such tool exists in Oracle Solaris"""^. This is 
DTrace, Solaris 10 Dynamic Tracing framework PO] . 

DTrace is event-driven, kernel-based instrumenta- 
tion that can see and measure all OS activity. It de- 
fines probes to trap and handlers (actions) using dy- 
namically interpreted C-like language. No application 
changes needed to use DTrace. This is very similar to 
triggers in database technologies. 

DTrace can: 

— Catch any event inside Solaris and function call 
inside Oracle. 

— Read and change any address location in-fiight. 

— Count the mutcx spins, trace the mutex waits, per- 
form experiments. 

— Measure times and distributions up to microsecond 
precision. 

Unlike standard tracing tools, DTrace works in So- 
laris kernel. When process entered probe function, the 
execution went to Solaris kernel and the DTrace filled 
buffers with the data. Kernel based tracing is more 
stable and have less overhead then uscrland. DTrace 
sees all the system activity and can account the time 
associated with kernel calls, scheduling, etc. 

In the following sections describing Oracle perfor- 
mance tuning are interleaved by mathematical estima- 
tions. 

III. Mutex spin model 

The Oracle mutex workflow schematically visu- 
alised in fig. [2j The Oracle process: 

— Uses atomic hardware instruction for mutex Get. 

— If missed, process spins by polling mutex location 
during spin get. 

— Number of spin cycles is bounded by spin count. 

— If spin get not succeeded, the process acquiring 
mutex sleeps. 



— During the sleep the process may wait for already 
free mutex. 

Oracle counts Gets and Sleeps and we can measure 
Utilization. 

This section introduces the mathematical model 
used to forecast mutex behaviour. It extends the 
model used in [101 ED] for general holding time dis- 
tribution and TTS spinlock concurrency. 

Consider a general stream of mutex holding events. 
The mutex memory location have been changed by 
sessions at time iJ^, k G Af using atomic instruction. 
This instruction blocked the shared bus and succeeded 
only when memory location is free. 

After acquisition the session will hold the mutex 
for time Xk distributed with p.d.f. p{t). I assume that 
incoming stream is Poisson with rate A and Hk (and 
Xk) are generally independent forming renewal pro- 
cess. Furthermore, I assume here the existence of at 
least second moments for all the distributions. 

The mutex acquisition request at time T^, m G J\f 
succeeds immediately if it finds the mutex free. Due 
to Serve-In-Random-Order nature of spinlocks, there 
is no simple relation between m and k. 

If the mutex was busy at time r,„: 

Hk < Trn < Hk + Xk for some k, 

then miss occurred. According to PASTA property the 
miss probability is equal to mutex utilization p. 

Missing session will spin polling the mutex location 
up to time A determined by _mutex_spin_count pa- 
rameter. The initial " contention free^^ approximation 
assumes that no other requests arrive during the spin 
(AA <C 1) and the session acquires the mutex if it be- 
come free while spinning. Therefore the spin succeeds 
if: 

T„, + A>Hk+Xk. 

If the mutex was not released during A, the session 
sleeps. 

According to classic considerations of renewal the- 
ory [22l [25, incoming requests peek up the holding 
intervals with p.d.f.: 

Ph = ^xp{x), (1) 

and observes the transformed mutcx holding time dis- 
tribution. Here S ~ E(a;) is the average mutex holding 
time. The p.d.f. and average of residual holding time 
is well-known: 

oo 

Pr{x) = ^ J p{t) dt , , 



The spin time distribution (conditioned on miss) 
follows the c.d.f. P^, but has a discontinuity [30] at 
t = A because the session acquiring latch never spins 
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more than A. The magnitude of this discontinuity is 
the overall probability that residual mutex holding 
time will be greater then A. Corresponding p.d.f. is: 

Psgix) = Pr{x)H{^ -X)+ Qr{/\)6{x - A) (3) 

Here H{x) and 5{x) is Heaviside step and bump func- 
tions correspondingly. 



The spinlock observables 

Oracle statistics allows measuring of spin ineffi- 
ciency (or sleep ratio) coefficient k. This is the proba- 
bility do not acquire mutex during the spin. Another 
crucial quantity is F - the average CPU time spent 
while spinning for the mutex: 



- Q.(A) pr{t) d< = I / Q{t)At 

A 

oo A cjo 

r = / tpsg{t) dt = i /dt / Q{z) dz 

t 



(4) 



Here subscript denotes "contention free" approximation. 
Using ^ for distributions with finite dispersion this 
expressions can be rewritten in two ways |30| . 

Low spin efficiency region 

The first form is suitable for the region of low spin 
efficiency A ^ 5: 
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From the above expressions it is clear that spin 
probes the mutex holding time distribution around the 
origin. 

Other parts of mutex holding time p.d.p. im- 
pact spin efficiency and CPU consumption only 
through the average holding time S. This al- 
lows to estimate how these quantities depend upon 
_mutex_spin_count (or A) change. If process never 
releases mutex immediately (p(0) = 0) then 



fc = I-f + 



0(A3) 
+ 0(A4) 



For Oracle performance tuning purpose we need to 
know what will happen if we double the A: 

In low efficiency region doubling the spin count will 
double the number of efficient spins and also double 
the CPU consumption. 

High spin efficiency region 

In high efficiency region the sleep cuts off the tail 
of spinlock holding time distribution: 



fco = I / (i - A)p(i) dt 



A 



_ E(tf) 
~ 2S 



j{t- ^fp{t) At = Sr- Tr 



(6) 
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Fig. 3. The concurrency formfactor fc{x). 

here Tj. is the residual after-spin holding time. This 
quantity will be used later. 

Oracle normally operates in this region of small 
sleeps ratio. Here the spin count is greater than num- 
ber of instructions protected by mutex A S*. The 
spin time is bounded by both the "residual holding 
time" and the spin count: 

F < min( Sr , A) 

The sleep prevents process from waste CPU for spin- 
ning on heavy tail of mutex holding time distribution 

Concurrency model 

In the real world several processes may spin on 
different processors concurrently. After the mutex re- 
lease all these sessions will issue atomic Test- and- Set 
instructions to acquire the mutex. Only one instruc- 
tion succeeds. What should the other sessions do? This 
is the principal question for hybrid TTS spinlocks. 

The session may cither continue the spin upto 
_mutex_spin_count or it may proceed to sleep im- 
mediately. The sleep seems reasonable because the ses- 
sion knows that spinlock just became busy. 

For further estimations I will use the second sce- 
nario. After the mutex release only one spinning ses- 
sion acquires it according to SIRO discipline, while all 
other sessions sleep. This is interesting queuing disci- 
pline that to my knowledge has not been explored in 
literature. Its C pseudocode looks like: 

while(l){ i:=0; 

while (lockOO kk i<spin_ count) i:=i+l; 
if (Test_and_Set (lock) )return SUCCESS ; 
SleepO ; 

> 

The time just after the mutex release is the Markov 
regeneration point. All the spinning behavior after this 
time is independent of previous history. 

Consider mutex holding interval of length x con- 
taining at least one (tagged) incoming request for mu- 
tex from Poisson stream with rate A. The conditional 
probability that this interval will contain exactly n 
incoming requests is: 



1 



1 - e 



-Xx 



The session will acquire mutex at these conditions 
with probability \/n. Overall probability for tagged 
session to acquire mutex is: 
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Here Ein(z) = J^(l~e is the Entire Exponential 

integral [24]. 

The concurrency formfactor fc{x) is a smooth 
monotonicaUy decreasing function with asymptotics 
Fc{x) = 1 - a;/2 + 0{x^) around and Fc{x) « 1/a; + 
+ 0{l/x'^), X — )■ 00. It may be efficiently approximated 
by rational functions l25|. Fig. [3] shows its Log-Log 
plot. 

Normally mutex operates in region Xx <^ 1 and 
value of formfactor Fc is very close to 1. 

According to ^ the probability that missing re- 
quest observe the holding interval from x to x + dx, 
its residual holding time will be from t to t + dt, t ^ x 
and it will concurrently acquire mutex on release is: 

dP = —p{x) Fc(A min (a;, A)) dxdt 
o 

Therefore, the spin inefficiency or probability not to 
acquire mutex by spin will be: 

fc = l--/ dt i p(a;) Fc(Amin(a;, A)) dt (8) 

Changing the integrations order we have: 

/e=l — — / min (.T, A)p(x)Fc(Amin (x, A)) dx (9) 
S Jo 

Comparing with ([5]) we can outline the contention con- 
tribution: 

k^k„ + ^ [ Q(.t)^(x(1-F,(A.t))) dx (10) 
b Jq dx 

Using the formfactor asymptotics in low contention 
region AA ^ 1 and Little law p = XS we can estimate 
how the spin efficiency depends on mutex utilization: 



k = ka + 



52 



xQ(a;)dx + o((AA)^) 



Data of mutex contention experiments (Fig. |4]) 
roughly agree with this linear approximation. 
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Fig. 4. k(p). 



IV. How Oracle requests the mutex 

This section discusses Oracle mutex internals be- 
yond the documentation. In order to explore mutexes 
we need reproducible testcases. 

Each time the Oracle session executes SQL oper- 
ator, it needs to pin the cursor in library cache using 
mutex. "True" mutex contention arises when the same 
SQL operator executes concurrently at high frequency. 
Therefore, simplest testcase for "Cursor: pin S" con- 
tention should look like: 

for i in 1.. 1000000 loop 
execute immediate 

'select 1 from dual where 1=2'; 
end loop; 

The script uses PL/SQL loop to execute fast SQL op- 
erator one million times. Pure "Cursor: pin S" mutex 
contention arises when I execute this script by several 
simultaneous concurrent sessions. 

It is worth to note that session_cached_cursors 
parameter value must be nonzero to avoid soft parses. 
Otherwise, we will see contention for "Library Cache" 
and "hash table" mutexes also. Indeed it is enough to 
disable session cursor cache and add dozen versions of 
the SQL to induce the "Cursor: mutex S" contention. 

Similarly, "Library cache: mutex X" contention 
arises when anonymous PL/SQL block executes con- 
currently at high frequency. 

for i in 1 .. 1000000 loop 

execute immediate 'begin demo_proc () ; end; ' ; 
end loop; 

Many other mutex contention scenarios possible. See 
blog |29]. 

Table [2] describes types of mutexes in contempo- 
rary Oracle. The "Cursor Pin" mutexes act as pin 
counters for library cache objects (e.g. child cursors) 
to prevent their aging out of shared pool. "Library 
cache" cursor and bucket mutexes protect KGL locks 
and static library cache hash structures. The "Cur- 
sor Parent" and "hash table" mutexes protect parent 
cursors during parsing and reloading. 

The mutex address can be obtained from 
x$mutex_sleep_history Oracle table. Such ''fixed" 
tables externalize internal Oracle structures to SQL. 
Due to dynamic nature of mutexes, Oracle does not 
have any fixed table like vSlatch containing data 
about all mutexes. 

According to Oracle documentation, this table is 
circular buffer containing data about latest mutex 
waits. However, my experiments demonstrated that 
it is actually hash array in SGA. The hash key of this 
array is likely to depend on mutex address and the ID 
of blocking session. Row for each next sleep for the 
same mutex and blocking session replaces the row for 
previous sleep. 
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Table 2. Mutex types in Oracle 11.2.0.3 
Mutexes in memory 

We can examine mutex using oradebug peek 
command. It shows memory contents: 

SQL> oradebug peek 0x3F119B5A8 24 
[3F119B5A8, 3F119B5CC) = 

00000016 00000001 OOOOOOID 000015D7 382DA701 03 
SID refcnt gets sleeps idn op 

According to Oracle documentation the mutex 
structure contains: 

— Atomically modified value that consists of two 
parts: 

Holding SID. Top 4 bytes contain SID of ses- 
sion currently holding the mutex exclusively 
or modifying it. It is session number 0x16 in 
the above example. 

Reference count. Lower 4 bytes represent the 
number of sessions currently holding the mutex 
in Shared mode (or is in- flux). 

— GETS - number of times the mutex was requested 

— SLEEPS - number of times sessions slept for the 
mutex 

— IDN- mutex Identifier. Hash value of library cache 
object protected by mutex or hash bucket number. 

— OP - current mutex operation. 

Oracle session changes the mutex state through dy- 
namic structure called Atomic Operation Log (AOL). 



Sessions fixed array: 
v$session -> x$ksuse 



(session) SID; ... 
KKS-UOL used : 
KGL-UOL SO Cache. 




Fig. 5. Mutex AOL 

The AOL contains information about mutex oper- 
ation in progress. To operate on mutex, session first 



creates AOL, fills it with data about mutex and de- 
sired operation, and calls one of mutex acquisition rou- 
tines. Each session has an array of references to the 
AOLs it is using. Fig. [5] illustrates this. AOLs are also 
used during mutex recovery if session crashes. 

Mutex modes and states 

Mutex can be held in three modes: 

— "Shared" (SHRD in traces) mode allows mutex be 
holding by several sessions simultaneously. It al- 
lows read (execute) access to the structure pro- 
tected by mutex. In shared mode the lower 4 bytes 
of mutex value represent the number of sessions 
holding the mutex. Upper bytes are zero. 

— " eX.clusive ' (EXCL) mode is incompatible with all 
other modes. Only one session can hold the mutex 
in exclusive mode. It allows session exclusively ac- 
cess the structure protected by mutex. In X mode 
upper bytes of mutex value are equal to holder SID. 
Lower bytes are zero. 

— "'Examine" (SIIRD_EXAM in dumps) mode indi- 
cates that mutex or its protected structure is in 
transition. In E mode upper bytes of mutex value 
are equal to holder SID. Lower bytes represent the 
number of sessions simultaneously holding the mu- 
tex in S mode. Session can acquire mutex in E 
mode or upgrade it to E mode even if other ses- 
sions are holding mutex in S mode. No other ses- 
sion can change mutex at that time. 

My experiments demonstrated that mutex state 
transitions diagram looks like infinite fence contain- 
ing shared states 0,1,2,3,. .. and corresponding ex- 
amine states 0,1,2,3,. .. There are also EXCL and 
LONG.EXCL states (fig.© . 




Not all operations are used by each mutex type. 
The " Cursor Pin" mutex pins the cursor in the Library 
Cache during parse and execution in 8-like way: 



CLOSE 



PARSE/EXEC 



Fig. 7. "Cursor Pin" mutex state diagram. 

Here E an S modes effectively act for "Cursor 
Pin" mutex as exclusive and free states. The "Library 
Cache" mutex uses X mode only. This paper math is 
targeted on these mutexes types. 

The "hash table" mutexes utilize both X and S 
modes. Such "read- write" spinlocks will be investi- 
gated in separate paper. 
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Table 3. Mutex waits in Oracle Wait Interface 

When session is waiting for mutex, it registers the 
wait in Oracle Wait Interface [TS]. Most frequently 
observed waits are named "cursor: pin S, "cursor: 
pin S wait on X and "library cache: mutex X". 

The naming scheme is presented in table [3] Here mutex 
is the name of mutex type. 

Experimental setup to explore mutex wait 

Unlike the latch, the details of mutex wait were 
not documented by Oracle. We need explore it using 
DTrace. To explore the latch in |3D], I acquired it di- 
rectly calling kslgetl function. This is not possible for 
mutex. However, by changing memory I can make mu- 
tex " busy" artificially. Oracle oradebug utility allows 
changing of any address inside SGA: 

SqL>oradebug poke <mutex addr> 8 0x100000001 
BEFORE: [3A9371338, 3A9371340) = 

00000000 00000000 
AFTER: [3A9371338, 3A9371340) = 

00000001 00000001 

This looks exactly like session with SID 1 is hold- 
ing the mutex in E mode. I wrote several scripts that 
simulate a busy mutex in S, X and E modes. In 
these scripts one session artificially holds mutex for 
50s. Another session tries to acquire mutex and "stat- 
ically" waits for "cursor: pin S" event during 49s. 
DTrace allowed me explore how Oracle actually waits 
for mutex. 

Original Oracle lOg mutex busy wait 

Oracle introduced the mutexes in version 10.2.0.2. 
Running the script against this version I saw that 
the waiting process consumed one of my CPUs com- 
pletely. Oracle showed millions microsecond waits that 
accounted for 3 seconds out of actual 49 second wait. 
The wait trace looks like: 

. . . spin 255 cycles 

yieldO 
spin 255 cycles 

yieldO 
. . . repeated 1910893 times 

The session waiting for mutex repeatedly spins 255 
times polling the mutex location and then issues 
yield() OS syscall. This syscall just allows other pro- 
cesses to run. 

Oracle 10.2-11.1 counts wait time as the time spent 
off the CPU waiting for other processes. If the system 
has free CPU power, Oracle thought it was not waiting 
at all and mutex contention was invisible. 



Therefore the old version mutex was " classic" spin- 
lock without sleeps. If the mutex holding time is al- 
ways small, this algorithm minimizes the elapsed time 
to acquire mutex. Spinning session acquires mutex im- 
mediately after its release. 

Such spinlocks are vulnerable to variability of hold- 
ing time. If sessions hold mutex for a long time, pure 
spinning wastes CPU. Spinning sessions can aggres- 
sively consume all the CPUs and affect the perfor- 
mance by priority inversion and CPU starvation. 

Mutex wait with Patch 6904068 

If long "cursor: pin S" waits were consistently ob- 
served in Oracle 10.2-11.1, then system do not have 
enough spare CPU for busy waiting. For such a case, 
Oracle provides the possibility to convert "busy" mu- 
tex wait into "standard" sleep. This enhancement was 
named "Patch 6904-068: High CPU usage when there 
are "cursor: pin S" waits". With this patch the mutex 
wait trace becames: 

. . . spin 255 times 

semsysO timeout=10 ms 
. . . repeated 4748 times 

The semtimedopO is "normal" OS sleep. The patch 
significantly decreases CPU consumption by spinning. 
Its drawback is larger elapsed time to obtain mutex. 
Ten milliseconds is long wait in Oracle timescale. 

One can adjust sleep time with centisecond gran- 
ularity and even set it to dynamically. In such case 
the Oracle instance behaves exactly like without the 
patch. It makes sense to install the patch 6904068 in 
10.2-11.1 OLTP environments proactively. 

IV. Mutex statistics 

Mutex statistics are the tools to diagnose its 
efficiency. Oracle internally counts the numbers 
of gets and sleeps for mutex. However, there is 
no fixed table containing current statistics. The 
x$mutex_sleep_history shows statistics as they 
were at the time of last sleep. This is not enough. 

Hopefully, Oracle provide us the x$ksmmem 
fixed table. It shows contents of any address inside 
SGA. The mutex value, its gets and sleeps can be di- 
rectly read out from Oracle memory. Repeatedly sam- 
pling mutex value we can estimate another key mutex 
statistics - Utilization. The Little's law U — XS allows 
computing the average mutex holding time S. 

Unlike the latches [30], mutex do not count its 
misses and spin gets. The miss ratio p should be es- 
timated from PASTA (Poisson Arrivals See Time Av- 
erages) property p ^ U. 

Oracle counts only the first mutex get, but all the 
secondary sleeps. Therefore, the spin inefficiency ko- 
efficient k differs from experimentally observed sleep 
ratw K= 4^^. 

Amisscs 

If sleeps are much longer then mutex correlation 
time then every spin- and- sleep cycle observe indcpen- 
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Description 


Definition 


Relations 


Mutex requests ar- 
rival rate 


\ Agcts 
^ Atimc 




Sleeps rate 


Aslccps 
^ Atime 


LJ — XpX 


Miss ratio (PASTA 
estimation) 


^ Amisses 

" Agcts 


P~Ux 


Sleeps ratio 


Aslccps 

>^ = A 

Amisscs 


^ Xp l~~kp 


Avg. holding time 
(Little's law) 






Mutex spin ineffi- 
ciency 


Aslccps 

Aspins 





Table 4. Mutex statistics. 

dent picture. Due to this independency each sleep has 
equal probability kp and one can estimate: 



M:^k + {kp)k + {kp)'^k + ... 



k 



1 - kp 



TablelHsunimarizes mutex statistics and their rela- 
tions. Corresponding script mutex_statistics.sql to 
measure mutex statistics is available in |29j . 

The spin time A can be obtained in my exper- 
iments by counting the "spin-and-yield' cycles per 
second. Contemporary Oracle versions can adjust A 
using parameter _mutex_spin_count. Therefore, we 
can compute spin and yield times separately by linear 
regression. 

Typical nocontention values for spin, yield and mu- 
tex holding time S in exclusive mode on some plat- 
forms are summarized in table [SJ 





Library cache 


Cursor pin 


spin 


yield 


Exadata 


0.3 — 5^s 


0.1 - 2/is 


l.S^is 


0.7ms 


Sparc T2 


2.5 - 12^8 


3.2 - 11/is 


8.7^is 


9.5/is 



Table 5. Average mutex spin and yield() times. 

Compare these microsecond times with default 
mutex sleep of 10 ms duration. Indeed, the mutex sleep 
is 10000 times longer than spin. 

V. " Mean Value Analysis" of mutex re- 
trials 

"Mean Value Analysis" (MVA) is an elegant ap- 
proach for queuing systems invented by M. Reiser, et 
al. [26]. Recent work |27] discussed the MVA for re- 
trial queues. Though not applicable directly to non- 
Markovian mutex, this approach can be useful for es- 
timations. 

The important point of the following approxima- 
tion is replacement of fixed time mutex sleep by 
exponential memorylcss distribution. According to 
PASTA, request arriving with frequency A finds mu- 
tex busy with probability p and goes to orbit (sleeps) 
for time T with probability kp. 



The waiting time consist of spin and sleep in the 
orbit times. 

W^Ws+Worb (11) 

The process acquires the busy mutex during repeating 
spins. The total spin time is: 



Ws^pT+ {kp)pT + {kpfpT 



1 - kp 



r (12) 



The request retries from orbit while mutex is busy 
and idle (Fig.H]). 

Worb = Wb + W, 

In steady state the overall busy mutex wait time 
is needed to serve all requests currently in system. 

Wb+Ws=LorbS + L,S + pSr 

Here Sr is the residual mutex holding time. According 
to Little's law: 

Lorb = AVForb, Lb = \Wb, Ls = \Ws, p = \S. 
Therefore: 

Wb = piSr + Worb) - (1 - P)W, (13) 

Flows per second going to and from the orbit 
should be balanced. For exponential sleep approxima- 
tion: 



k\p + k 



XWb XWo. 



T T 

here T is the average time to sleep. One can substitute, 
in spirit of MVA, the (|T3|) into this expression and 
estimate the average wait time spent on orbit as: 



Worb = 



1 - kp 

The overall wait time became: 
p /I - fc^p 



{p{T + Sr)-{l- p)Ws) 



w 



1 — kp \1 — k p 



T + k{T + Tr) 



(14) 



where Tr is the residual after-spin mutex holding time 
that already appeared in 

Normally in Oracle 11.2 the spin inefficiency fc <C 1 
and huge sleep time T ^ 10* x {F, A, Tr} dominates in 
these formulas and limits the mutex wait performance 



W 



kp 



1 - kp 



{T + Tr) 



In order to compare this to mutex experimental data 
it should be noted that, unlike the queuing theory, the 
Oracle Wait Interface does not treat the first spin as 
a part of wait|18j. The wait time registered by OWI 
isWo^W - pT. 

Oracle performance tuning uses an "average wait 
duration" metric from AWR report [1] as a contention 
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signature. This is the OWI waiting time normalized 
by the number of OWI waits: 



Wo/kp : 



1 



1 - kp 



{T + Tr 



If this quantity significantly differs from Ics, it may be 
a sign of abnormality in mutex utilization or holding 
time. 

Of course the above estimations do not account for 
OS scheduling and are not applicable when number of 
active processes exceeds the number of CPUs. 

VI. 11.2.0.2.2 Mutex waits diversity 

Since April 2011 the latest Oracle versions use 
completely new concept of mutex waits. 

My Oracle Support site [15] described this in 
note Patch IO4II6I8 Enhancement to add different 
"Mutex" wait schemes. The enhancement allows one 
of three concurrency wait schemes and introduces 3 
parameters to control the mutex waits: 
_mutex_wait_scheme — Which wait scheme to 

use: 

- Always YIELD. 

1 - Always SLEEP for _mutex_wait_time. 

2 - Exponential Backoff upto _mutex_wait_time. 
_mutex_spin_count — the number of times to spin. 

Default value is 255. 
_mutex_wait_time — sleep timeout depending on 

scheme. Default is 1. 
The note also mentioned that this fix effectively su- 
persedes the patch 6904068 described above. 

The SLEEPS. Mutex wait scheme 1 

In mutex wait scheme 1 session repeatedly requests 
1 ms sleeps: 

kgxSharedExamine (...) 
yieldO 

pollsysO timeout=l ms repeated 25637 times 

The _mutex_wait_time parameter controls the sleep 
timeout in milliseconds. This scheme differs from 
patch 6904068 by one additional spin-and-yield cycle 
at the beginning and smaller timeout. 

Performance of this scheme is sensitive to 
_mutex_wait_time tuning, fig. |5] demonstrates that 
at moderate concurrency the short mutex sleeps per- 
forms better and results in bigger throughputs. How- 
ever, such millisecond sleep will be rounded to cen- 
tisecond on platforms like Solaris, Windows and lat- 
est HP-UX. This is because most OS can not sleep for 
very short times. 

MVA estimation for mutex wait scheme 1 results 



Wi 



kp 



[kT + Tr) . 



1 - kp 

You see that additional spin at the beginning ef- 
fectively reduces wait time T multiplying it by fc ^ 1 . 
This increases the performance. 



_mutex_wait_scheme=l. Throughput (tps). 1/4 Exadata X2-2 



50000 




N threads 



Fig. 8. Mutex wait scheme 1 throughput. 

Default "Exponential BackofT" scheme 2 

Oracle uses the scheme 2 by default. This scheme is 
named "Exponential backoff" in documentation. Un- 
like the previous versions, contemporary mutex wait 
do not consumes CPU. Surprisingly, DTrace shows 
that there is no exponential behavior by default. Ses- 
sion repeatedly sleeps with 1 cs duration: 

yieldO call repeated 2 times 

semsysO timeout=10 ms repeated 4237 times 

To reveal exponential backoff one need to increase the 
_mutex_wait_time parameter. 

SQL> alter system set "_mutex_wait_time"=30 ; 

yieldO call repeated 2 times 

semsysO timeout=10 ms repeated 2 times 

semsysO timeout=30 ms repeated 2 times 

semsysO timeout=80 ms 

semsysO timeout=70 ms 

semsysO timeout=160 ms 

semsysO timeout=150 ms 

semsysO timeout=300 ms repeated 159 times 

This closely resembles the Oracle 8i latch ac- 
quisition algorithm [30l [29]. In scheme 2 the 
_mutex_wait_time controls maximum wait time in 
centiseconds. Due to exponentiality the mutex wait 
scheme 2 is insensitive to its value. Indeed, only sleep 
after the fifth unsuccessful spin is affected by this pa- 
rameter [29] . 

Default mutex scheme 2 wait differs from patch 
6904068 by two yield() syscalls at the beginning. 
These two spin-and-yields change the mutex wait per- 
formance drastically (fi-g.[ni)- They effectively multiply 
centisecond wait time T hy k^: 

kp 



Wo 



1 - kp 



(k'T + Tr). 



Classic YIELDS. Mutex wait scheme 
The _mutex_wait .scheme consists mostly of 
repeating spin-and-yield cycles. 
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Fig. 9. Mutex wait scliemes performance. 

yieldO call repeated 99 times 
pollsysO timeout=l ms 
yieldO call repeated 99 times 
pollsysO timeout=l ms 



It differs from aggressive mutex waits used in previous 
Oracle versions by 1ms sleep after each 99 yields. This 
sleep significantly reduces CPU consumption and in- 
creases robustness. Unfortunately previous MVA style 
analysis is not applicable for this wait scheme. 

The scheme is very fiexible [23] ■ The sleep 
duration and yield frequency are tunable by 
_wait_yield_sleep_time_msecs . . . parameters. One 
can also specify different wait modes for standard and 
high priority processes. This allows almost any com- 
bination of yield and sleeps including lOg and patch 
6904068 behaviors. 

Comparison of mutex wait schemes 

Fig. [3] compares performance of " Library Cache" 
mutex contention testcase on Exadata platform for all 
3 wait schemes and the patch 6904-068 and lOg mutex 
algorithms as well. 

The figures demonstrate that: 

Default scheme 2 is well balanced in all concur- 
rency regions. 

Wait scheme 1 should be used when the system is 
constrained by CPU. 

Wait scheme has the throughput close to lOg 
in medium concurrency region and may be recom- 
mended in case of plethora of free CPU. 

lOg mutex algorithm had the fastest performance 
in medium concurrency workloads. However, its 
throughput fell down when number of contending 
threads exceeds number of CPU cores. CPU con- 
sumption increased rapidly beyond this point. This 



excessive CPU consumption starves processors and 
impacts other database workloads. 
Patch 6904068 results in very low CPU consump- 
tion, but the largest elapsed time and the worst 
throughput. 

IV. Mutex Contention 

Mutex contention occurs when the mutex is re- 
quested by several sessions at the same time. Diag- 
nosing mutex contention we always should remember 
Little's law 

U = \S 

Therefore, the contention can be consequence of ei- 
ther: 

Long mutex holding time S due to, for example, 
high SQL version count, bugs causing long mutex 
holding time or CPU starvation and preemption is- 
sues. 

Or it may be due to high mutex exclusive Utiliza- 
tion. Mutexes may be overutilized by too high SQL 
and PL/SQL execution rate or bugs causing excessive 
requests. 

Mutex statistics help to diagnose what actually 
happens. 

Latest Oracle versions include many fixes for mu- 
tex related bugs and allow fiexible wait schemes, ad- 
justment of spin count and cloning of hot library cache 
objects. My blog [52] continuously discusses related 
enhancements . 

Traditionally tuning of mutex performance prob- 
lems was focused on changing the application and re- 
ducing the mutex demand. To achieve this one need to 
tune the SQL operators, change the physical schema, 
raise the bug with Oracle Support, etc. . . [T6 l ITT l ITS l 



However, such tuning may be too expensive and 
even require complete application rewrite. This arti- 
cle discusses one not widely used tuning possibility 
- changing of mutex spin count. This was commonly 
treated as an old style tuning, which should be avoided 
by any means. The public opinion is that increasing of 
spin count leads to waste of CPU. However, nowadays 
the CPU power is cheap. We may already have enough 
free resources. It makes sense to know when the spin 
count tuning may be beneficial. 

Mutex spin count tuning 

Long mutex holding time may cause the mutex 
contention. Default _mutex_spin_count =255 may 
be too small. Longer spinning may alleviate this. If 
the mutex holding time distribution has exponential 
tail: 

Q{t) ~ Cexp(-t/T) 

k - Cexp(-i/T) 

T ^ Sr- Crexpi-t/r) 
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Fig. 10. Spin count tuning. 

It is easy to see that if "sleep ratio" is small enough 
{k <C 1) then 

Doubling the spin count will square the "sleep ra- 
tio" and will only add part of order of k to spin CPU 
consumption. 

In other words, if the spin is already efficient, it 
is worth to increase the spin count. Fig. [10] demon- 
strates effect of spin count adjustment for the "Library 
Cache" mutex contention testcase. 

The spin count tuning is very effective. Elapsed 
time fell rapidly while CPU increased smoothly. The 
number of mutex waits demonstrates almost linear be- 
havior in logscale. This confirms the scaling rule. 

Conclusions 

This work investigated the possibilities to diag- 
nose and tune mutexes, retrial Oracle spinlocks. Using 
DTrace, it explored how the mutex works, its spin- 
waiting schemes, corresponding parameters and statis- 
tics. The mathematical model was developed to pre- 
dict the effect of mutex tuning. 

The results are important for performance tuning 
of highly loaded Oracle OLTP databases. 
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